Hacker News new | threads | past | comments | ask | show | jobs | submit DanielBMarkham (43766) | logout
Facebook-owned sites were down (facebook.com)
2583 points by nabeards 1 day ago | flag | hide | past | favorite | 1295 comments





There's still no connectivity to Facebook's DNS servers:

    > traceroute a.ns.facebook.com
      traceroute to a.ns.facebook.com (129.134.30.12), 30 hops max, 60 byte packets
      1  dsldevice.attlocal.net (192.168.1.254)  0.484 ms  0.474 ms  0.422 ms
      2  107-131-124-1.lightspeed.sntcca.sbcglobal.net (107.131.124.1)  1.592 ms  1.657 ms  1.607 ms 
      3  71.148.149.196 (71.148.149.196)  1.676 ms  1.697 ms  1.705 ms
      4  12.242.105.110 (12.242.105.110)  11.446 ms  11.482 ms  11.328 ms
      5  12.122.163.34 (12.122.163.34)  7.641 ms  7.668 ms  11.438 ms
      6  cr83.sj2ca.ip.att.net (12.122.158.9)  4.025 ms  3.368 ms  3.394 ms
      7  * * *
      ...
So they're hours into this outage and still haven't re-established connectivity to their own DNS servers.

"facebook.com" is registered with "registrarsafe.com" as registrar. "registrarsafe.com" is unreachable because it's using Facebook's DNS servers and is probably a unit of Facebook. "registrarsafe.com" itself is registered with "registrarsafe.com".

I'm not sure of all the implications of those circular dependencies, but it probably makes it harder to get things back up if the whole chain goes down. That's also probably why we're seeing the domain "facebook.com" for sale on domain sites. The registrar that would normally provide the ownership info is down.

Anyway, until "a.ns.facebook.com" starts working again, Facebook is dead.


Notes as Facebook comes back up:

"registrarsafe.com" is back up. It is, indeed, Facebook's very own registrar for Facebook's own domains. "RegistrarSEC, LLC and RegistrarSafe, LLC are ICANN-accredited registrars formed in Delaware and are wholly-owned subsidiaries of Facebook, Inc. We are not accepting retail domain name registrations." Their address is Facebook HQ in Menlo Park.

That's what you have to do to really own a domain.


Out of curiosity, I looked up how much it costs to become an registrar. Based on the ICANN site, it is $4,000 USD per yr, plus variable fees and transactions fees ($0.18/yr). Does anyone have experience or insight into running a domain registrar? Curious what it would entail (aside from typical SRE type stuff).

> transactions fees ($0.18/yr)

Wow, I had no idea it was so cheap[1] once you're a registrar. The implication is that anyone who wants to be a domain squatting tycoon should become a registrar. For an annual cost of a few thousand dollars plus $0.18 per domain name registered, you can sit on top of hundreds of thousands of domain names. Locking up one million domain names would cost you only $180,000 a year. Anytime someone searched for an unregistered domain name on your site, you could immediately register it to yourself for $0.18, take it off the market, and offer to sell it to the buyer at a much inflated price. Does ICANN have rules against this? Surely this is being done?

[1] "Transaction-based fees - these fees are assessed on each annual increment of an add, renew or a transfer transaction that has survived a related add or auto-renew grace period. This fee will be billed at USD 0.18 per transaction." as quoted from https://www.icann.org/en/system/files/files/registrar-billin...


> Surely this is being done?

Personally saw this kind of thing as early as 2001.

Never search for free domains on the registar site unless you are going to register it immediately. Even whois queries can trigger this kind of thing, although that mostly happens on obscure gtld/cctld registries which have a single registrar for the whole tld.


I can sadly attest to this behavior as recently as a couple years ago :(

I searched for a domain that I couldn't immediately grab (one of more expensive kind) using a random free whois site... and when I revisited the domain several weeks later it was gone :'(

Emailed the site's new owner D: but fairly predictably got no reply.

Lesson learned, and thankfully on a domain that wasn't the absolute end of the world.

I now exclusively do all my queries via the WHOIS protocol directly. Welp.


> Surely this is being done?

Probably every major retail registrar was rumored to do this at some point. Add to your calculation that even some heavyweights like GoDaddy (IIRC) tend to run ads on domains that don't have IPs specified.


Network Solutions definitely did it. I searched for a few domains along the lines of "network-solutions-is-a-scam.com", and watched them come up in WHOIS and DNS.

There are also fees you have to pay to the owner of the tld. For example .com has a $8.39 fee. In total that would be $8.57 per .com domain.

You are off by a factor of almost 50.


They have a pretty interesting page on the topic: https://www.icann.org/resources/pages/financials-55-2012-02-...

They want you to have $70k liquid.


And they want you to be someone else than Peter Sunde:

https://torrentfreak.com/icann-refuses-to-accredit-pirate-ba...


This is not completely accurate. The whole reason a registrar with domain abc.com can use ns1.abc.com is because glue records are established at the registry, this allows a bootstrap that keeps you in from a circular dependency. All that said it’s usually a bad idea, for someone as large as Facebook they should have nameservers across zones ie a.ns.fb.com b.ns.fb.org c.ns.fb.co Etc…

There is always a step which involve to email the domain when a domain update its information with the registrar. In this case, facebook.com and registrarsafe.com are managed by the same NS. You need these NS to query the MX to send that update approval by email and unblock the registrar update. Glue records are more for performance than to make that loop. I'm maybe missing something but, hopefully they won't need to send an email to fix this issue.

I have literally never once received an email to confirm a domain change. Perhaps the only exception is on a transfer to another registrar (though I can't recall that occurring, either).

To be fair, we did have to get an email from eurid recently for a transfer auth code, but that was only because our registrar was not willing to provide.

In any case, no, they will not need to send an email to fix this issue.


I just changed the email address on all my domains. My inbox got flooded with emails across three different domain vendors. If they didn't do it before, they sure are doing it now.

Yes I meant for transferring to another DNS server. In this case, they can't.

This is not true when your the registrar (as in this case) in fact your entire system could be down and you’d still have access to the registries system to do this update

FB is running their own registrar. Supposedly they can sidestep the email procedure if it's even there to begin with.

Facebook does operate their own private Registrar, since they operate tens of thousands of domains. Most of these are misspellings and domains from other countries and so forth.

So yes, the registrar that is to blame is themselves.

Source: I know someone within the company that works in this capacity.


> That's also probably why we're seeing the domain "facebook.com" for sale on domain sites. The registrar that would normally provide the ownership info is down.

That’s not how it works. The info of whether a domain name is available is provided by the registry, not by the registrars. It’s usually done via a domain:check EPP command or via a DAS system. It’s very rare for registrar to registrar technical communication to occur.

Although the above is the clean way to do it, it’s common for registrars to just perform a dig on a domain name to check if it’s available because it’s faster and usually correct. In this case, it wasn’t.


When the NS hostname is dependent on the domain it serves, "glue records" cover the resolution to the NS IP addresses. So there's no circular dependency type issue

Good catch. Hopefully, they won't need an email sent to fb.com from registrarsafe.com to update an important record to fix this. What a loop.

Its partially there. C and D are still not in the global tables according to routeviews ie. 185.89.219.12 is still not being advertised to anyone. My peers to them in Toronto have routes from them, but not sure how far they are supposed to go inside their network. (past hop 2 is them)

% traceroute -q1 -I a.ns.facebook.com

traceroute to a.ns.facebook.com (129.134.30.12), 64 hops max, 48 byte packets 1 torix-core1-10G (67.43.129.248) 0.133 ms

2 facebook-a.ip4.torontointernetxchange.net (206.108.35.2) 1.317 ms

3 157.240.43.214 (157.240.43.214) 1.209 ms

4 129.134.50.206 (129.134.50.206) 15.604 ms

5 129.134.98.134 (129.134.98.134) 21.716 ms

6 *

7 *

% traceroute6 -q1 -I a.ns.facebook.com

traceroute6 to a.ns.facebook.com (2a03:2880:f0fc:c:face:b00c:0:35) from 2607:f3e0:0:80::290, 64 hops max, 20 byte packets

1 toronto-torix-6 0.146 ms

2 facebook-a.ip6.torontointernetxchange.net 17.860 ms

3 2620:0:1cff:dead:beef::2154 9.237 ms

4 2620:0:1cff:dead:beef::d7c 16.721 ms

5 2620:0:1cff:dead:beef::3b4 17.067 ms

6 *

7 *

8 *


Kevin Beaumont:

   »The Facebook outage has another major impact: lots of mobile apps constantly poll Facebook in the background = everybody is being slammed who runs large scale DNS, so knock on impacts elsewhere the long this goes on.«

https://twitter.com/GossiTheDog/status/1445118907187175427

Oh my gosh, their IPv6 address contains "face:b00c"...

> 2a03:2880:f0fc:c:face:b00c:0:35


Besides being fun and quirky, it is actually useful for their sysadmins as well as sysadmins at other orgs.

Well at least it will in 2036, when IPv6 goes mainstream.


How difficult is to get such a "vanity" address?

You just need to get a large enough block so that you can throw most of it away by adding your own vanity part to the prefix you are given. IPv6 really isn't scarce so you can actually do that.

The face:b00c part is in the Interface ID, so this did not even need a large block (Though I am sure they have one).

dead beef sounds about right

My suspicion is that since a lot of internal comms runs through the FB domain and since everyone is still WFH, then its probably a massive issue just to get people talking to each other to solve the problem.

I don’t know how true it is but a few reports claim employees can’t get into the building with their badges.

I remember my first time having a meeting at Facebook and observing none of the doors had keyholes and thinking "hope their badge system never goes down"

> I remember my first time having a meeting at Facebook and observing none of the doors had keyholes and thinking "hope their badge system never goes down"

Every internet-connected physical system needs to have a sensible offline fallback mode. They should have had physical keys, or at least some kind of offline RFID validation (e.g. continue to validate the last N badges that had previously successfully validated).


In case of emergency, break glass...

...the doors are glass right?


Zucks personal conference room has 3 glass walls, so I’ve been amusing myself imagining him just throwing a chair through one of the walls

That glass is bullet resistant.

Do they (you?) call him that at FB?

Yes, "Zuck".

I don't think he has the strength.


I'm assuming someone in building security has watched the end of Ex Machina...and applied some learnings, or not.

All doors are glass with the right combination of a halligan bar, an axe and a gasoline powered saw.

And I guess beyond that point, walls are glass. Or you need explosives.


Aaaaaaand it's down!

maybe they're open by default, like old 7-11 stores when they went 24hrs and had no locks on the doors :)

Breaking the glass to get in to fix the service is totally a good business move.

A few hundred bucks of glass Vs a billion wiped off the share price if the service is down for a day and all the user's go find alternatives.


Link to such claims here: https://news.ycombinator.com/item?id=28750894

I have no doubt that the publicly published post-mortem report (if there even is one) will be heavily redacted in comparison to the internal-only version. But I very much want to see said hypothetical report anyway. This kind of infrastructural stuff fascinates me. And I would hope there would be some lessons in said report that even small time operators such as myself would do well to heed.


I think the real take away is that no one has this figured out.

A small company has to keep all of its customers happy (or at least be responsive when issues arise, at a bare minimum).

Massive companies deal in error budgets, where a fraction of a percent can still represent millions of users.



I guess they didn't have an "emergency ingress" plan.

The they will have to old school it and try a brick.

I've heard on Blind this is unrelated, more of a Covid restriction issue.

What is Blind? Or shouldn't I ask?

www.teamblind.com

Enjoy.


A copy of Glassdoor

More like a crossover between Glassdoor... and Gab.

first rule of Blind, never talk about Blind

You mean the same problem as when GMail goes down and Googlers can't reach each other?

I guess good decentralized public communication services could solve those issues for everybody.


Googler here - my opinions are my own, not representing the company

at the lowest level in case of severe outage we resort to IRC, Plain Old Telephone Service and, sometimes, stick-it notes taped to windows...


Around here we use Slack for primary communications, Google Hangouts (or Chat or whatever they call it now) as secondary, and we keep an on-call list with phone numbers in our main Git repo, so everyone has it checked out on their laptop, so if the SHTF, we can resort to voice and/or SMS.

I remembered to publish my cell phone's real number on the on-call list rather than just my Google Voice number since if Hangouts is down, Google Voice might be too.


Where are the tapes though? Colo on separate tectonic tape or nah?

?

I think texasbigdata is talking about backup tapes and maybe mistyped tectonic plate

Backup tapes and in production servers are kept at different colocation sites to protect data from fire and other catastrophes of that level

Using colo sites on separate tectonic plates would protect you from catastrophes on a geological cataclysm level


We don't use tapes, everything we have is in the cloud, at a minimum everything is spread over multiple datacenters (AZ's in AWS parlance), important stuff is spread over multiple regions, or depending on the data, multiple cloud providers.

Last time I used tape, we used Ironmountain to haul the tapes 60 miles away which was determined to be far enough for seismic safety, but that was over a decade ago.


Thank you kind sir.

Some people here say their fallback IRC doesn't work due to DNS reliance. :|

One of my employers once forced all the staff to use an internally-developed messenger (for sake of security, but some politics was involved as well), but made an exception for the devops team who used Telegram.

Telegram? Interesting choice!

Devops like Telegram because it has proper bot API, unlike many other competitors.

Oh! It makes sense. While I don't like telegram for some reasons, their API is totally top notch and a real pleasure to work with.

That would completely defeat the purpose... I have a hard time believing that.

Why? Even if it's not DNS reliance, if they self-hosted the server (very likely) then it'll be just as unreachable as everything else within their network at the moment.

The entire purpose of an IRC backup is in case shit hits the fan. That means having it run on a completely separate stack.

What use is it if it runs on the same stack as what you might be trying to fix?


Clearly "our entire network is down, worldwide" wasn't part of their planning. Don't get too cocky with your 20/20 hindsight.

I don't think it's cocky or 20/20 hindsight. Companies I've worked for specifically set up IRC in part because "our entire network is down, worldwide" can happen and you need a way to communicate.

I bet they never tested taking out their own DNS.

IRC does use DNS at least to get hostnames during connection. I'd be surprised if it didn't use it at other points.


I’ve setup hosts files in case DNS was down to access critical systems before. It’s a perfectly reasonable precaution.

My small org, maybe 50 ips/hosts we care about, maintain a hosts file stills, for those nodes public and internal names. It's in Git, spread around and we also have our fingers crossed.

If only IRC would have been built with multi-server setups in mind, that forward messages between servers, and continues to work if a single - or even a set - of servers would go down, just resulting in a netsplit...Oh wait, it was!

My bet is, FB will reach out to others in FAMANG, and an interest group will form maintaining such an emergency infrastructure comm network. Basically a network for network engineers. Because media (and shareholders) will soon ask Microsoft and Google what their plans for such situations are. I'm very glad FB is not in the cloud business...


> If only IRC would have been built with multi-server setups in mind, that forward messages between servers, and continues to work if a single - or even a set - of servers would go down, just resulting in a netsplit...Oh wait, it was!

yeah if only Facebook's production engineering team had hired a team of full time IRCops for their emergency fallback network...


Considering how much IRCops were paid back in the day (mostly zero as they were volunteers) and what a single senior engineer at FB makes, I'm sure you will find 3-4 people spread amongst the world willing to share this 250k+ salary amongst them.

That is called outbound network :)

I worked on the identity system that chat (whatever the current name is) and gmail depend on and we used IRC since if we relied on the system we support we wouldn’t be able to fix it.

Word is that the last time Google had a failure involving a cyclical dependency they had to rip open a safe. It contained the backup password to the system that stored the safe combination.

The safe in question contained a smartcard required to boot an HSM. The safe combination was stored in a secret manager that depended on that HSM.

The engineer attempted to restart the service, but did not know that a restart required a hardware security module (HSM) smart card. These smart cards were stored in multiple safes in different Google offices across the globe, but not in New York City, where the on-call engineer was located. When the service failed to restart, the engineer contacted a colleague in Australia to retrieve a smart card. To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager.

Source: Chapter 1 of "Building Secure and Reliable Systems" (https://sre.google/static/pdf/building_secure_and_reliable_s... size warning: 9 MB)


Lovely.

Safes typically have the instructions on how to change the combination glued to the inside of the door, and ending with something like "store the combination securely. Not inside the safe!"

But as they say: make something foolproof and nature will create a better fool.


I'm sure this sort of thing won't be a problem for a company whose founding ethos is 'move fast and break things.' O:-)

Anyone remember the 90s? There was this thing called the Information Superhighway, a kind of decentralised network of networks that was designed to allow robust communications without a single point of failure. I wonder what happened to that...?

Folks are still chatting here... seems to work as designed...

Aren't we still communicating on HN, even though the possibly largest network is down? Can you send email?

We are a dying breed... A few days ago my daughter asked me "will you send me the file on Whatsapp or Discord?". I replied I will send an email. She went "oh, you mean on Gmail?" :-D

Hahaha... I can relate to that. Email is synonymous with Gmail now, something that only dads and uncles use. :-)

Somehow I gotta figure out how to get kiddos interested in networking...

Setting up a Minecraft server has been a good experience for my kiddo to learn more networking.

I am going to guess it’s one of those things the techies want to get round to, but in reality there is never any chance or will to do it.

I can assure you that Google has a procedure in place for that.

I unfortunately cannot edit the parent comment anymore but several people pointed out that I didn't back up my claim or provided any credentials so here they are:

Google has multiple independent procedures for coordination during disasters. A global DNS outage (mentioned in https://news.ycombinator.com/item?id=28751140) was considered and has been taken into account.

I do not attempt to hide my identity here, quite the opposite: my HN profile contains my real name. Until recently a part of my job was to ensure that Google is prepared for various disasterous scenarios and that Googlers can coordinate the response independently from Google's infrastructure. I authored one of the fallback communication procedures that would likely be exercised today if Google's network experienced a global outage. Of course Google has a whole team of fantastic human beings who are deeply involved in disaster preparedness (miss you!). I am pretty sure they are going to analyze what happened to Facebook today in light of Google's emergency plans.

While this topic is really fascinating, I am unfortunately not at liberty to disclose the details as they belong to my previous employer. But when I stumble upon factually incorrect comments on HN that I am in a position to correct, why not do that?


In future news: Waymo outage results in engineers unable to get to data center. Engineers don't even know where their servers are.

Give us the dirt on how google does it's disaster planning exercises please! Do you do these exercises all at once or slowly over the year?

Interesting that you are asking for the dirt given that DiRT stands for Disaster and Recovery Testing, at least at Google.

Every year there is a DiRT week where hundreds of tests are run. That obviously requires a ton of planning that starts well in advance. The objective is, of course, that despite all the testing nobody outside Google notices anything special. Given the volume and intrusiveness of these tests, the DiRT team is doing quite an impressive job.

While the DiRT week is the most intense testing period, disaster preparedness is not limited to just one event per year. There are also plenty tests conducted througout the year, some planned centrally, some done by individual teams. That's in addition to the regular training and exercises that SRE teams are doing periodically.

If you are interested in reading more about Google's approach to distaster planning and preparedness, you may be interested in reading the DiRT, or how to get dirty section from Shrinking the time to mitigate production incidents—CRE life lessons (https://cloud.google.com/blog/products/management-tools/shri...) and Weathering the Unexpected (https://queue.acm.org/detail.cfm?id=2371516).


Why not do both? ;)

Yup, they make a new chat app if the previous one is down.

Google Talk, Google Voice, Google Buzz, Google+ Messenger, Hangouts, Spaces, Allo, Hangouts Chat, and Google Messages.

At some point, they must run out of names, right?


You forgot google meet!

And Google Wave.

You forgot the chat boxes inside other apps like Google docs, Gmail, YouTube, etc.

And Google Pay, apparently.

> Yup, they make a new chat app if the previous one is down.

Continuous Deployment.


For those who don't know who he is: l9i would know this. Just clarifying that this is not an Internet nobody guessing.

He is still an anonymous dude to me.

HN Profile -> Personal Website -> LinkedIn -> Over 10 years experience as Google Site Reliability Engineer

Is the LinkedIn profile linking back to the hn account?

Security Engineer asking?

Ha, no. It just occured to me that any random hacker news account could link to somebody's personal account and claim authority on some subject.

Google SRE for 10 years, ending as the Principal Site Reliability Engineer (L8).

s/the//

Google has more than 1 L8 SRE.


I don't know who either he or you are, so...

I was clarifying his comment, since he didn't mention that this is not a guess, but inside knowledge.

I was not trying to establish a trust chain.

Take from it what you will.


Why does it matter if he's guessing or not?

Because, it may shock you to know, but sometimes people just go on the Internet and tell lies.

No shit Google has plans in place for outages.

But what are these plans, are they any good... a respected industry figure who's CV includes being at Google for 10 years doesn't need to go into detail describing the IRC fallback to be believed and trusted that there is such a thing.


I've found that when I post things I learned on the job here it actually causes people to tell me I'm wrong or made it up even more often…

It's kind of amusing given that employers are usually pretty easy to deduce based on comments…

That's just an 'appeal to authority'.

No-one knows or cares who made the statement, it may as well have been 'water is wet', it was useless and adds nothing but noise.


I found a comment that was factually incorrect and I felt competent to comment on that. Regrettably, I wrote just one sentence and clicked reply without providing any credentials to back up my claim. Not that I try to hide my identity, as danhak pointed out in https://news.ycombinator.com/item?id=28751644, my full name and URL of my personal website are only a click away.

I have replied to my initial comment with provide some additonal context: https://news.ycombinator.com/edit?id=28752431. Hope that helps.


That’s…not what “appeal to authority” means.

I've read here on HN that exactly this was the issue as they had one of the bigger outages (I think it was due to some auth service failure) and GMail didn't accept incoming mail.

A Gmail outage would be barely an inconvenience as Gmail plays a minor role in Google's disaster response.

Disclaimer: Ex-Googler who used to work on disaster reponse. Opinions are my own.


What do you think all those superfluous chat apps were for?

I think the issue there is that in exchange for solving the "one fat finger = outage" problem, you lose the ability to update the server fleet quickly or consistently.

BGP is decentralised.

LOL - score one against building out all tooling internally (a la Amazon and apparently Facebook too)

The rate at which some amazon services lately go done because other AWS services went down proves that this is an unsustainable house of cards anyways.

Netflix knows how to build on top of a house of cards.

There's a joke here somewhere about how bad the final season was

Those communications are done over irc at FB for exactly this purpose.

time to start working at your mfing desk again, johnson

They supposedly can't enter facebook office right now. Their cards don't work.

Why would a system like that have to be in their online infrastructure?

For doing LDAP lookups against the corporate directory? Oh wait, LDAP configuration of course depends on DNS and DNS is kaputt...

source?


Sheera Frenkel @sheeraf Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.

"Something went wrong. Try reloading."

its not loading for me. could you say what it said?



From the Tweet, "Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors."

"Something went wrong. Try reloading."

its not loading for me. could you say what it said?


> Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.

https://nitter.net/sheeraf/status/1445099150316503057


Disclose.tv @disclosetv JUST IN - Facebook employees reportedly can't enter buildings to evaluate the Internet outage because their door access badges weren’t working (NYT)

What do you think will be the impact on WFH and office requirements?

Unlikely, PagerDuty was invented for this kind of thing

Oh I'm sure everyone knows whats wrong, but how am I supposed to send an email, find a coworkers phone number, get the crisis team on video chat etc etc if all of those connections rely on the facebook domain existing?

Hence the suggestion for PagerDuty. It handles all this, because responders set their notification methods (phone, SMS, e-mail, and app) in their profiles, so that when in trouble nobody has to ask those questions and just add a person as a responder to the incident.

Yes, but Facebook is not a small company. Could PagerDuty realistically handle the scale of notifications that would be required for Facebook's operations?

PagerDuty does not solve some of the problems you would have at FB's scale, like how do you even know who to contact ? And how do they login once they know there is a problem ?

Sure. As long as you plan for disaster.

The place where I worked had failure trees for every critical app and service. The goal for incident management was to triage and have an initial escalation for the right group within 15 minutes. When I left they were like 96% on target overall and 100% for infrastructure.


Even if it can’t, it’s trivial to use it for an important subset, ie is Facebook.com down, is the ns stuff down etc. So there is an argument to be made for still using an outside service as a fallback

Sure, if you're...

- not arrogant - or complacent - haven't inadvertently acquired the company - know your tech peers well enough to have confidence in their identity during an emergency - do regular drills to simulate everything going wrong at once

Lots of us know what should be happening right now, but think back to the many situations we've all experienced where fallback systems turned into a nightmarish war story, then scale it up by 1000. This is a historic day, I think it's quite likely that the scale of the outage will lead to the breakup of the company because it's the Big One that people have been warning about for years.


I guarantee you that every single person at Facebook who can do anything at all about this, already knows there's an issue. What would them receiving an extra notification help with?

We kind of got off topic, I was arguing that if you were concerned about internal systems being down (including your monitoring/alerting) something like pager duty would be fine as a backup. Even at huge scale that backup doesn’t need to watch everything.

I don’t think it’s particularly relevant to this issue with fb. I suspect they didn’t need a monitoring system to know things were going badly.


Heck of a coincidence I must say...

I can imagine this affects many other sites that use FB for authentication and tracking.

If people pay proper attention to it, this is not just an average run of the mill "site outage", and instead of checking on or worrying about backups of my FB data (Thank goodness I can afford to lose it all), I'm making popcorn...

Hopefully law makers all study up and pay close attention.

What transpires next may prove to be very interesting.


Indeed, what happened shows a good reason not to rely only on social log-in for various sites.

NYT tech reporter Sheera Frenkel gives us this update:

>Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.

https://twitter.com/sheeraf/status/1445099150316503057


Got a good chuckle imagining a fuming Zuckerberg not being allowed into his office, thinking the world is falling apart.

Can’t get in to fix error

I just got off a short pre-interview conversation with a manager at Instagram and he had to dial in with POTS. I got the impression that things are very broken internally.

How much of modern POTS is reliant on VOIP? In Australia at least, POTS has been decommissioned entirely, but even where it's still running, I'm wondering where IP takes over?

I am guessing that most POTS is VOIP now, except for the few places with existing copper infrastructure that has not been decommissioned yet.

This person has a POTS line in their current location, and a modem, and the software stack to use it, and Instagram has POTS lines and modems and software that connect to their networks? Wow. How well do Instagram and their internal applications work over 56K?

He called on his mobile phone. As a result it was a voice-only conversation, no video.

They could have dialed in by their own cell phone though

I read that as POTUS at first and paused for a minute

What is POTS?


Plain old telephone system. Aka a phone.

Plain Old Telephone System

Looks like they misconfigured a web interface that they can't reach anymore now that they're off the net.

"anyone have a Cisco console cable lying around?"


The only one they have is serial and the company's one usb-to-serial converter is missing.

The voices, stories, announcements, photos, hopes and sorrows of millions, no, literally billions of people, and the promise that they may one day be seen and heard again now rests in the hands of Dave, the one guy who is closest to a Microcenter, owns his own car and knows how to beat the rush hour traffic and has the good sense to not forget to also buy an RS-232 cable, since those things tend to get finicky.

Great visual!

Yeah the patch to fix BGP to reach the DNS is sent by email to @facebook.com. Ooops no DNS to resolve the MX records to send the patch to fix the BGP routers.

Seriously? Is that how it works?

No. A network like Facebook's is vast and complicated and managed by higher-level configuration systems, not people emailing patches around.

If this issue is even to do with BGP it's much more likely the root of the problem is somewhere in this configuration system and that fixing it is compounded by some other issues that nobody foresaw. Huge events like this are always a perfect storm of several factors, any one or two of which would be a total noop alone.


The Swiss cheese model of accidents. Occasionally the holes all align.

https://en.wikipedia.org/wiki/Swiss_cheese_model


The fun part of BGP is they apparently make a lot of use of it within their network, not just advertising routes externally.

https://engineering.fb.com/2021/05/13/data-center-engineerin...

(and yes, fb.com resolves)


No, the backbone of the internet is not maintained with patches sent in emails.

You are very wrong about that ;) https://lkml.org/

You are very wrong about that https://lkml.org/

Clearly you and the person you replied to are talking about very different things.

I think the sub-comment is confusing the linux kernel with BGP.

In a way, the Linux kernel does power the "backbones of the internet".

There are a hell of a lot of non-linux OS's running on core routers, but yes, in a way. However BGP isn't via email.

On the other hand, I and my office mate at the time negotiated the setup of a ridiculous number of BGP sessions over email, including sending configs. That was 20 years ago.

luckily not... would be absolutely terrible to have the backbone only on linux

Interoperability and a thriving ecosystem are necessities for resiliency.

Note that resiliency and efficiency are often working against each other.



I don't know. I doubt. It's just funny to think that you need email to fix BGP, but DNS is down because of BGP. You need DNS to send email which needs BGP. It's a kind of chicken and egg problem but at a massive scale this time.

Sheera Frenkel:

    Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
https://twitter.com/sheeraf/status/1445099150316503057

You'd think they'd have worked that into their DR plans for a complete P1 outage of the domain/DNS, but perhaps not, or at least they didn't add removal of BGP announcements to the mix.

Can someone explain why it is also down when trying to access it via Tor using its onion address: http://facebookwkhpilnemxj7asaniu7vnjjbiltxjqhye3mhbshg7kx5t...

Or when trying ips directly: https://www.lifewire.com/what-is-the-ip-address-of-facebook-...

I would have expected a DNS issue to not affect either of these.

I can understand the onionsite being down if facebook implemented it the way a thirdparty would (a proxy server accessing facebook.com) instead of actually having it integrated into its infrastructure as a first class citizen.


You can get through to a web server, but that web server uses DNS records or those routes to hit other services necessary to render the page. So the server you hit will also time out eventually and return a 500

The issue here is that this outage was a result of all the routes into their data centers being cut off (seemingly from the inside). So knowing that one of the servers in there is at IP address "1.2.3.4" doesn't help, because no-one on the outside even knows how to send a packet to that server anymore.

routing was down _everywhere_ so tor is getting a better experience than most people by getting a 500 error

DNS is back, looks like systems are still coming online.

Yeah that's some pretty hardcore A/B testing right there.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:
0
Looking